Australian  Government 
Department  of  Defence 

Defence  Science  and 
Technology  Organisation 


A  Confidence  Estimator  for  Speaker  Verification 
Using  Dual  DET  Curves 

T.  C.  Tao 

Command,  Control,  Communications  and  Intelligence  Division 
Defence  Science  and  Technology  Organisation 

DSTO— RR— 0358 


ABSTRACT 

In  speaker  verification,  the  result  of  a  trial  is  traditionally  summarised  as  an 
arbitrary  score,  where  a  higher  score  indicates  stronger  evidence  in  favour  of 
the  speaker  hypothesis.  However  this  is  difficult  to  interpret.  It  is  useful  to 
convert  this  score  into  a  “confidence  level” ,  i.e.  the  posterior  probability  that 
the  speaker  hypothesis  is  correct,  given  the  score.  One  of  the  simplest  formulae 
to  obtain  a  confidence  level  is  using  a  logistic  curve,  but  this  requires  the 
assumption  that  the  true  and  impostor  speaker  scores  are  distributed  according 
to  a  Normal  distribution.  In  this  report  I  propose  a  new  formula,  called  the 
dual  Detection  Error  Trade-Off  (DET)  curve,  since  it  represents  the  same 
information  as  a  DET  curve.  This  formula  avoids  the  assumption  of  normally 
distributed  target  and  impostor  scores.  Experiments  on  the  NIST  99  data 
prove  the  dual  DET  curve  performs  slightly  better  than  the  logistic  curve. 


APPROVED  FOR  PUBLIC  RELEASE 


DSTO  -RR-0358 


Published  by 

DSTO  Defence  Science  and  Technology  Organisation 
PO  Box  1500 

Edinburgh,  South  Australia  5111,  Australia 

Telephone:  (08)  8259  5555 
Facsimile:  (08)  8259  6567 

©  Commonwealth  of  Australia  2010 
AR  No.  AR-014-858 
October,  2010 


APPROVED  FOR  PUBLIC  RELEASE 


DSTO-RR-0358 


A  Confidence  Estimator  for  Speaker  Verification  Using  Dual 

DET  Curves 

Executive  Summary 

Speaker  recognition  is  the  problem  of  identifying  people  from  their  voices,  and  has  impor¬ 
tant  applications.  For  instance,  it  is  often  desirable  to  determine  (or  confirm  in  the  case 
of  Speaker  Verification )  the  identities  of  various  speakers  in  various  applications,  such  as 
telephone  calls.  The  human  voice  is  unique,  or  at  least  very  difficult  to  mimic  successfully, 
so  the  ability  to  identify  a  speaker  based  on  his  or  her  voice  offers  better  security  over  some 
other  means  of  identification  such  as  passwords  (which  can  be  forgotten  or  compromised) 
or  physical  objects  (which  can  be  stolen). 

In  speaker  verification,  the  result  of  a  trial  is  traditionally  summarised  as  an  arbitrary 
score,  where  a  higher  score  indicates  stronger  evidence  in  favour  of  the  speaker  hypothesis. 
However  this  is  difficult  to  interpret.  It  is  useful  to  convert  this  score  into  a  “confidence 
level”,  i.e.  the  a-posteriori  probability  that  the  speaker  hypothesis  is  correct,  given  the 
score.  Ideally,  the  confidence  level  should  be  100%  for  a  true-speaker  trial  and  0%  for  an 
impostor  trial.  This  can  only  occur  when  the  score  itself  is  ideal,  e.g.  the  score  is  always 
above  a  threshold  for  true  speaker  trials  and  below  the  same  threshold  for  impostor  trials. 
In  practice,  the  scores  are  never  ideal,  so  it  is  impossible  to  obtain  ideal  confidence  levels. 

One  of  the  simplest  formulae  to  obtain  a  confidence  level  is  using  a  logistic  curve,  but  this 
requires  the  assumption  the  true  and  impostor  speaker  scores  are  distributed  according  to 
a  Normal  distribution.  In  this  report  I  propose  a  new  formula,  called  the  dual  Detection 
Error  Trade-Off  (DET)  curve,  since  it  represents  the  same  information  as  a  DET  curve. 
This  formula  avoids  the  assumption  of  normally  distributed  target  and  impostor  scores. 

The  quality  of  the  confidences  can  be  summarised  using  a  metric  known  as  the  Normalised 
Cross  Entropy (NCE).  The  NCE  is  maximised  when  the  confidence  level  is  equal  to  the 
a-posteriori  probability  of  the  speaker  hypothesis,  given  the  score. 

Experiments  were  performed  on  the  NIST  99  data.  Two  speaker  verification  systems  were 
used  to  test  the  dual  DET  curve  by  comparing  it  against  the  logistic  curve.  The  dual  DET 
curve  performed  better  than  the  logistic  curve  in  adverse  conditions  (bad  signal-to-noise 
ratio  or  short  utterance  length),  but  slightly  worse  in  ideal  conditions. 


DSTO  -RR-0358 


DSTO-RR-0358 


Author 


Trevor  C  Tao 

CCCID 

Trevor  Chi- Yuen  Tao  graduated  from  the  University  of  Ade¬ 
laide  (Australia)  in  2005  with  a  PhD  in  Applied  Mathematics 
and  started  employment  at  DSTO  in  August  2006.  His  current 
research  involves  speech  processing. 


DSTO  -RR-0358 


DSTO-RR-0358 


Contents 

Glossary  xi 

1  Introduction  1 

2  Existing  Confidence  Estimators  1 

3  New  Confidence  Metric  2 

3.1  Ideal  dual  DET  curve .  2 

3.2  Non-ideal  dual  DET  curves .  4 

4  Relationship  Between  Confidence  and  Validation  Data  4 

5  Confidence  Evaluation  5 

6  Experiment  7 

7  Summary  and  Conclusion  9 

References  11 

Appendices 

A  Derivation  of  the  Logistic  Curve  16 

B  Derivation  of  NCE  17 


vii 


DSTO  -RR-0358 


Figures 

1  DET  curve  and  dual,  n  =  1.0 .  12 

2  DET  curve  and  dual,  n  =  1.1 .  12 

3  DET  curve  and  dual,  n  =  6.0 .  13 

4  DET  curve  and  dual,  piecewise  linear .  13 

5  Dual  DET  curve  versus  Logistic  curve .  14 

6  DET  curve,  System  A .  14 

7  DET  curve,  System  A  (20  speakers)  .  15 

8  DET  curve,  System  B  .  15 


DSTO-RR-0358 


Tables 

1  Dual  DET  Curve  vs  Logistic  (System  A  proper  NIST  evaluation) .  8 

2  Dual  DET  Curve  vs  Logistic  (System  A,  20  speakers  only) .  8 

3  Dual  DET  Curve  vs  Logistic  (System  B,  proper  NIST  evaluation) .  9 


DSTO-RR-0358 


x 


Glossary 


DCF  Detection  Cost  Function 
DET  Detection  Error  Trade-off 
FAR  False  Alarm  Rate 
LLR  Log  Likelihood  Ratio 
MR  Miss  Rate 

NIST  National  Institute  of  Standards  and  Technology 
SNR  Signal-to-Noise  Ratio 


DSTO-RR-0358 


1  Introduction 


Speaker  recognition  is  the  problem  of  identifying  people  from  their  voices,  and  has  impor¬ 
tant  applications.  For  instance,  it  is  often  desirable  to  determine  (or  confirm  in  the  case 
of  Speaker  Verification )  the  identities  of  various  speakers  in  various  applications,  such  as 
telephone  calls.  The  human  voice  is  unique,  or  at  least  very  difficult  to  mimic  successfully, 
so  the  ability  to  identify  a  speaker  based  on  his  or  her  voice  offers  better  security  over  some 
other  means  of  identification  such  as  passwords  (which  can  be  forgotten  or  compromised) 
or  physical  objects  (which  can  be  stolen). 

Given  an  audio  file  and  a  claimed  speaker  identity,  a  Speaker  Verification  system  typically 
outputs  a  score,  summarising  the  evidence  in  favour  of  the  speaker  hypothesis.  One 
of  the  main  difficulties  with  Speaker  Verification  systems  is  that  the  score  cannot  be 
easily  interpreted  -  it  is  merely  an  arbitrary  number  where  higher  scores  indicate  stronger 
evidence  in  favour  of  the  speaker  hypothesis.  In  some  cases,  the  score  may  have  a  specific 
meaning  such  as  a  log  likelihood  ratio  (LLR)  between  target  and  background  model,  but 
in  other  situations  there  is  no  such  interpretation  available.  For  instance,  if  T-norm[l]  is 
applied  to  an  LLR,  the  new  score  can  no  longer  be  interpreted  as  an  LLR. 

The  concept  of  confidence  is  being  increasingly  adopted  in  Speaker  Verification  systems  [2, 
3,  4,  5,  6].  Roughly  speaking,  a  confidence  level  is  intended  to  complement  the  system 
output,  to  indicate  if  the  results  are  reliable.  For  instance,  one  can  say  e.g.  the  LLR 
is  approximately  2.5  with  85%  confidence,  or  the  speaker  should  be  rejected  with  90% 
confidence. 

The  report  is  organised  as  follows:  Section  2  discusses  existing  confidence  estimators. 
Section  3  proposes  a  new  estimator.  Section  4  discusses  the  relationship  between  confi¬ 
dence  estimators  and  the  validation  data  set.  Section  5  discusses  evaluation  of  confidence 
estimators.  Section  6  discusses  an  experiment  comparing  the  performance  of  the  pro¬ 
posed  confidence  estimator  with  that  of  an  existing  confidence  estimator.  Section  7  is  the 
conclusion,  summarising  the  report’s  findings. 


2  Existing  Confidence  Estimators 


One  of  the  main  difficulties  with  confidence  estimators  is  that  there  are  many  interpreta¬ 
tions  of  the  word  confidence.  For  instance  higher  confidence  can  reffect  stronger  evidence 
in  favour  of  the  speaker  hypothesis  or  the  “right  decision”  hypothesis  (i.e.  the  system 
made  the  right  decision  to  accept  or  reject  the  speaker [5]).  Confidence  can  also  reflect 
whether  an  LLR  lies  within  a  specified  interval[2]  (borrowing  the  idea  from  confidence 
intervals  in  statistics).  In  this  paper  I  restrict  my  attention  to  confidence  estimators  that 
are  interpretable  as  a  probability  conf  =  p(Hi\E),  where  Hi  is  the  speaker  hypothesis 
and  E  is  some  evidence,  such  as  LLR  score,  channel  type,  signal-to-noise  ratio  (SNR)  or 
other  information.  In  other  words,  the  confidence  is  an  estimate  of  the  probability  that 


1 


DSTO  -RR-0358 


the  purported  speaker  is  the  actual  speaker  given  the  available  evidence.  In  this  report,  I 
assume  that  the  only  available  evidence  is  the  LLR  score. 

One  of  the  simplest  confidence  estimators  uses  Gaussian  distributions  to  model  the  LLR 
scores  from  true-speaker  and  impostor  trials  [7].  Assuming  that  the  prior  probabilities 
7To,7Ti  of  the  speaker  and  null  hypotheses  are  known,  the  confidence  can  be  estimated 
using  Bayes  law: 


1  ttqp{E\Hq)  +  ttip(E\Hi) 

Assuming  that  p(E\Ho)  and  p(E\H\)  are  normally  distributed  with  the  same  variance, 
this  reduces  to  a  logistic  function 


p{Hi\s) 


1 

1  +  exp(-/?0  -  /?is) 


(2) 


For  the  derivation  of  this  see  Appendix  A.  The  parameters  f3o,{3i  can  be  estimated  via 
regression.  Note  that  this  bypasses  the  requirement  of  knowing  the  priors. 


A  well-known  disadvantage  of  this  approach  is  that  the  score  distributions  are  generally 
not  Gaussian  (although  this  does  not  necessarily  lead  to  poor  results  on  actual  data[8]). 
There  are  many  other,  more  complex  confidence  estimators  available.  It  is  beyond  the 
scope  of  this  report  to  discuss  these.  I  refer  the  interested  reader  to  [5]. 


3  New  Confidence  Metric 

3.1  Ideal  dual  DET  curve 


I  propose  a  new  way  of  measuring  confidence.  The  key  observation  is  that  a  detection 
cost  function  (DCF)  of  the  form  cost  =  olMR  +  (1  —  a)FAR  can  be  associated  with  a 
confidence  level  of  100(1  —  a)%  in  the  following  sense:  given  the  above  DCF,  accepting  the 
speaker  hypothesis  would  cost  more  in  the  long  run  than  rejecting  it,  unless  one  was  at  least 
100(1  —  a)%  confident  of  the  speaker  hypothesis.  For  any  confidence  level  100(1  —  a)%  and 
corresponding  DCF  cost  =  aMR  +  (1  —  a) FAR  one  can  calculate  the  optimal  threshold 
0\-a.  This  yields  a  function  from  confidence  to  score.  Assuming  this  function  is  strictly 
monotonically  increasing  and  hence  invertible,  one  can  obtain  a  function  from  score  to 
confidence.  It  turns  out  this  function  is  “orthogonal”  to  the  Detection  Error  Trade-off 
(DET)  curve  in  the  sense  of  representing  the  same  information  in  a  different  manner.  In 
other  words,  one  can  obtain  the  confidence  function  from  the  DET  curve  and  vice  versa. 
Hence  I  call  this  function  the  dual  DET  curve.  To  make  this  mathematically  precise, 
consider  an  “ideal”  DET  curve  satisfying  the  following  properties: 


2 


DSTO-RR-0358 


•  For  any  false  alarm  rate  FAR  the  miss  rate  is  given  by  MR  =  f(FAR)  for  some 
function  /  :  [0, 1]  — >  [0, 1]. 

•  The  function  /  is  monotonically  decreasing  (i.e.  f(x)  >  f(y )  if  and  only  if  x  <  y) 
with  /( 0)  =  1,  /( 1)  =  0. 

•  The  function  f  is  smooth,  strictly  convex  (i.e.  first  derivative  is  strictly  increasing) 
with  /'( 0)  =  -oo,/'(l)  =  0. 


The  first  two  properties  are  derived  by  considering  the  possible  trade-off  in  a  DET  curve 
by  varying  the  threshold:  a  higher  false  alarm  rate  implies  a  smaller  miss  rate  (and  vice 
versa)  with  the  extreme  values  /( 0)  =  1,  /( 1)  =  0  corresponding  to  a  threshold  of  plus  or 
minus  infinity.  The  third  property  implies  a  unique  confidence  level  for  any  false  alarm 
rate.  For  any  FAR  it  is  not  hard  to  show  that  the  DCF  aMR  +  (1  —  a)FAR  is  minimised 
if  and  only  if  f'(FAR)  =  —(1  —  a) /a.  Hence  the  confidence  is  given  by 


q(FAR)  =  1  —  a  = 


-f'(FAR) 

1  -  f'(FAR) 


(3) 


Conversely,  equation  (3)  implies  that  f'(FAR)  =  —q(FAR)/(  1  —  q(FAR))  and  thus 

Hfar)  =  1-L  l^W)dt  (4) 

Hence  q  can  be  obtained  from  /  and  vice  versa. 

Figures  1  to  4  demonstrate  four  examples  of  DET  curves1  and  corresponding  duals.  The 
first  three  are  a  unit  ball  of  the  form  (1  —  F AR)n  +  (1  —  MR)n  =  1  for  n  =  1.0, 1.1, 6.0. 
In  Figure  1,  the  DET  curve  is  a  perfect  straight  line  indicating  chance  performance,  i.e. 
the  true  and  impostor  speaker  scores  are  identical.  As  expected  there  is  no  discriminating 
capability  since  the  confidence  is  50%  for  any  FAR.  Figure  2  shows  a  DET  curve  not  far 
away  from  chance  performance  and  as  expected,  the  confidence  is  very  poor.  Even  for 
extreme  values  of  FAR  the  confidence  is  always  between  0.3  and  0.7.  In  Figure  3  much 
better  discriminability  is  observed.  For  instance,  extreme  values  of  FAR  yield  a  confidence 
near  0  or  1,  indicating  near  certainty  of  a  decision  (reject  for  0,  accept  for  1).  This  is 
clarified  in  Figure  4  where  the  DET  curve  is  piecewise  linear  instead  of  a  unit  ball.  In  this 
case,  the  confidence  is  90%  for  FAR  <0.1  and  10%  for  FAR  >0.1. 

The  above  discussion  relates  a  function  from  FAR  to  MR  with  a  function  from  FAR  to 
confidence.  In  practice  one  is  interested  in  a  function  from  score  to  MR  or  confidence. 
Unfortunately,  it  is  not  easy  to  relate  the  score  to  FAR.  In  fact,  this  information  is  un¬ 
available  in  a  standard  DET  plot.  However  since  the  FAR  is  a  decreasing  (and  hence 

xNot  all  example  DET  curves  are  ideal  but  the  results  given  can  be  obtained  by  approximation,  e.g. 
treating  the  DET  curve  as  a  limit  of  a  sequence  of  ideal  DET  curves. 


3 


DSTO  -RR-0358 


invertible)  function  of  score,  it  is  reasonable  to  consider  the  confidence  function  as  being 
a  dual  to  the  DET  curve. 

Figure  4  shows  that  for  a  good  DET  curve  (near  the  bottom  left  corner),  the  confidence 
will  be  near  1  for  scores  above  the  threshold  and  near  zero  for  scores  below  (recall  that 
FAR  is  a  decreasing  function  of  score,  so  the  effect  is  to  flip  the  right  diagram  of  Figure  4 
horizontally).  Thus  in  the  ideal  case,  both  the  DET  curve  and  dual  DET  curve  will  have 
their  “energy”  concentrated  at  zero,  i.e.  MR  and  confidence  are  both  near  unity  when 
FAR  is  near  zero,  or  both  near  zero  when  FAR  is  away  from  zero. 


3.2  Non-ideal  dual  DET  curves 


In  practice  one  does  not  work  with  an  ideal  DET  curve  and  corresponding  dual  curve. 
The  DET  curve  is  calculated  on  some  finite-size  dataset  so  it  is  approximated  by  a  zigzag 
sequence  of  horizontal  and  vertical  lines.  Moreover,  the  DET  curve  is  not  guaranteed  to 
be  convex,  even  if  we  were  allowed  to  approximate  the  curve  (by  replacing  the  horizontal 
and  vertical  lines  with  a  smooth  curve).  The  non-convexity  of  the  DET  curve  implies  that 
there  is  not  necessarily  a  1-1  correspondence  between  confidence  level  and  LLR  score.  In 
other  words,  given  a  confidence  level  100(1  —  a)%,  there  may  be  several  thresholds  that 
minimise  the  DCF  cost  =  a  AIR.  +  (1  —  a)  FAR. 

Thus  one  must  approximate  the  dual  DET  curve  as  a  piecewise  smooth  function,  where 
specified  confidence  levels  are  chosen.  If  one  chose,  say,  multiples  of  5%  confidence,  the  dual 
DET  curve  would  be  piecewise  linear  with  “nodes”  (6*0.051 0.05),  (tfo.10,  0.10) . . .  (#0.95,  0.95). 
If  we  wanted  finer  granularity  we  could  have  multiples  of  1%  confidence  instead  of  5%  con¬ 
fidence.  One  must  also  have  a  rule  for  choosing  between  multiple  thresholds  corresponding 
to  a  particular  confidence  level.  This  is  discussed  in  Section  6. 

Note  that  care  is  taken  to  avoid  a  “hard”  confidence  level  of  100%  or  0%  at  all  costs.  In 
any  case,  the  nature  of  statistical  testing  implies  that  one  never  expects  a  confidence  of 
100%  or  0%.  Moreover,  the  Normalised  Cross  Entropy  (NCE)  measure  severely  penalises 
hard  confidence  levels  if  they  turn  out  to  be  wrong.  This  is  described  below  in  Section  5. 
In  the  case  of  the  dual  DET  curve,  the  simplest  solution  is  to  set  upper  and  lower  bounds 
for  confidences.  Any  confidence  level  above  qmo,x  or  below  qmin  is  adjusted  to  qmax  or  qmin 
for  some  0  <  qmin  <  qmax  <  1  respectively. 


4  Relationship  Between  Confidence  and 

Validation  Data 


4 


Recall  that  the  purpose  of  a  confidence  measure  is  to  convert  some  evidence  (such  as  an 
LLR  score)  into  an  estimate  of  how  confident  we  are  that  the  hypothesised  speaker  is 
the  real  speaker.  The  latter  essentially  depends  on  results  obtained  from  some  “previous” 


DSTO-RR-0358 


validation  data2.  For  instance,  an  LLR  of  -2.5  from  a  particular  system  may  seem  low,  since 
the  score  was  negative.  Indeed,  if  the  validation  data  was  such  that  most  of  the  targets 
scored  positive  and  most  impostors  scored  around  zero,  then  an  LLR  of  -2.5  indicates 
strong  evidence  in  favour  of  the  non-speaker  hypothesis.  But  if  the  target  and  impostor 
scores  were  clustered  near  -2.0  and  -5.0  respectively,  then  the  LLR  of  -2.5  is  in  fact  “high”. 
Thus,  roughly  speaking,  the  confidence  is  a  function  of  the  actual  LLR  and  some  statistics 
obtained  from  LLR  scores  obtained  on  a  previous  dataset. 

The  essential  point  is  that  one  cannot  obtain  the  confidence  from  an  LLR  score  per  se. 
The  conversion  can  only  take  place  in  the  context  of  some  “previous”  validation  data  as 
alluded  to  above.  This  dataset  must  therefore  be  homogenous  with  respect  to  the  LLR,  in 
the  sense  that  similar  scores  are  generated.  Ideally,  the  previous  dataset  will  be  generated 
by  the  same  speaker(s)  as  those  being  tested,  under  similar  conditions  regarding  noise, 
channel,  SNR,  environment,  and  so  on.  As  is  well  known,  it  is  not  trivial  to  obtain  a 
sufficient  amount  of  homogenous  data  in  real-world  scenarios,  but  discussion  of  this  is 
beyond  the  scope  of  this  report. 


5  Confidence  Evaluation 


It  is  important  to  have  some  method  of  confidence  evaluation,  i.e.  a  metric  to  compare 
different  confidence  estimators.  One  can  think  of  this  as  a  “meta-confidence”,  i.e.  how 
confident  we  are  that  my  confidence  measure  is  good.  A  confidence  metric  is  good  if  it 
consistently  returns  high  confidences  for  true-speaker  cuts  and  low  confidences  for  impostor 
cuts. 

To  express  this  in  mathematics,  let  q(E)  denote  a  function  that  estimates  a  confidence 
level  given  the  evidence  E.  Let  M(q)  denote  the  quality  of  the  confidence  estimator.  Thus 
if  the  Normalised  Cross  Entropy  (see  below)  is  used  as  the  confidence  estimator,  then  the 
NCE  would  reflect  the  quality  of  the  confidences  in  the  same  way  the  DET  curve  reflects 
the  quality  of  the  LLR  scores. 

The  ideal  confidence  estimator  would  be  a  function  q(E)  that  returns  1  when  the  speaker 
hypothesis  is  correct  and  0  otherwise.  This  ideal  estimator  corresponds  to  perfect  knowl¬ 
edge  given  the  evidence,  i.e.  p(H\\E)  =  1  when  the  speaker  hypothesis  is  correct,  and 
p(Hi\E)  =  0  when  the  null  hypothesis  is  correct.  The  closer  the  actual  confidence  estima¬ 
tor  q  is  to  the  ideal,  the  better  (i.e.  higher)  the  estimate  M(q). 

There  are  many  choices  of  function  M(q).  I  choose  the  well-known  Normalised  Cross 
Entropy  (NCE)  measure.  The  NCE  is  given  by 

NCE <*>  -  •  <5> 

2This  point  will  be  obvious  to  experienced  researchers;  the  discussion  is  mainly  intended  for  the  layman. 


5 


DSTO  -RR-0358 


where  a;  is  a  random  variable  representing  the  true  identity  {oj  =  accept  with  probability 
7Ti,  reject  with  probability  ttq  =  1  —  7Ti).  H(uj)  =  — 7Tolog7To  —  7Tilog7Ti  is  the  entropy  of 
oj.  A  =  E[H\1  —  log2  q{E )  :  —  log(l  —  q(E))]  where  alb  :  c  is  a  shorthand  for  b  if  a  is  true 
and  c  otherwise3,  and  E[-\  denotes  expectation  with  respect  to  p(E,u>),  treating  the  true 
identity  and  evidence  as  random  variables  (in  particular,  when  the  only  evidence  taken 
into  account  is  the  score,  then  E  is  a  random  variable  over  the  real  line). 

To  maximise  NCE,  we  need  to  minimise  the  term  A,  since  i7(<+)  does  not  depend  on  q. 
One  can  show  that  A  =  E[Hi?  —  log2  q{E)  :  —  log(l  —  q{E))]  reduces  to 

A  =  /  -n0p(E\H0)  log(l  -  q(E))  -  1: ip(E\Hi)  log q(E) dE .  (6) 

Je 

One  can  also  show  that  A  is  minimised  (and  hence  NCE  maximised)  when 


q(E)=p(H1\E).  (7) 

Thus  the  NCE  has  a  desirable  property:  the  confidence  estimator  is  encouraged  to  reflect 
the  probability  of  the  target  hypothesis  given  the  evidence.  A  metric  with  this  property 
is  called  a  proper  scoring  rule  [3]. 

When  evaluated  on  a  dataset,  the  NCE  is  approximated  by 

i  Nf  1  Nt 

A  =  _7r°Ar^log^  “  l(Ei))  _7ri]y  X!loS9(-Ei),  (8) 

f  i= 0  *  i= 1 

where  Nf,  Nt  denote  the  number  of  impostor  and  target  trials. 

The  proof  of  equations  (6)-(8)  is  shown  in  Appendix  B. 

Equations  (5)  and  (6)  imply  the  NCE  has  an  upper  bound  of  1,  occurring  only  for  the 
ideal  estimator  q  =  q.  In  practice,  the  ideal  estimator  is  generally  not  obtainable.  For 
instance,  if  the  dataset  had  one  target  and  one  impostor  trial  where  the  system  outputs 
the  same  evidence,  then  it  is  impossible  to  have  q(E)  =  1  and  q(E)  =  0  for  both  trials, 
since  the  confidence  q  can  only  be  a  function  of  the  evidence.  The  ideal  estimator  q  could 
only  occur  if  the  evidence  itself  were  “ideal” ,  e.g.  the  LLR  is  always  above  a  threshold  for 
targets  and  below  the  same  threshold  for  impostors. 

An  NCE  of  zero  indicates  “baseline”  performance  where  the  only  evidence  taken  into  ac¬ 
count  is  the  prior  probabilities  (7To,  7Ti).  Of  course  it  is  possible  to  have  a  poorly  performing 
confidence  estimator  with  NCE  =  0  that  uses  evidence  other  than  the  prior.  It  is  also 
possible  for  the  NCE  to  be  arbitrarily  large  and  negative,  indicating  very  bad  performance. 
This  occurs,  for  example,  if  the  confidence  estimator  returns  extreme  values  in  the  wrong 
direction  (near  0  for  true  speakers  or  1  for  impostors). 

3This  notation  is  derived  from  programming  languages  such  as  C++  and  Java. 


6 


DSTO-RR-0358 


6  Experiment 

We  compare  the  performance  of  the  dual  DET  curve  (Section  3)  with  that  of  the  logistic 
curve  (described  in  Section  2). 

The  dual  DET  curve  is  approximated  as  a  piecewise  linear  function  using  nodal  points 
(6q.  q )  where  0q  is  the  threshold  corresponding  to  confidence  level  q  and 

q  €  {0.01,  0.05,  0.10,  0.15, . . . ,  0.95,  0.99}.  (9) 


Thus  the  nodes  of  the  dual  DET  curve  occur  when  q  =  0.01  or  0.99  or  a  multiple  of  0.05. 
I  found  this  to  be  a  reasonable  compromise  between  accuracy  and  complexity.  If  there  are 
multiple  thresholds  corresponding  to  the  minimum  of  aMR  +  (1  —  a)FAR  I  simply  chose 
the  smallest  threshold.  I  found  experimentally  that  this  did  not  cause  significant  errors. 


There  is  an  important  subtlety  in  estimating  the  parameters  of  the  dual  DET  curve:  the 
number  of  true  and  impostor  scores  must  be  assumed  equal.  Otherwise  one  would  have  to 
apply  linear  weighting  to  compensate.  For  instance,  if  there  were  20,000  impostor  scores 
but  only  200  true  speaker  scores,  one  must  pretend  that  each  true  speaker  score  occurs 
hundredfold  instead  of  once.  It  is  not  necessary  to  enlarge  the  actual  list  of  scores  since 
the  same  effect  can  be  achieved  by  using  the  following  DCF: 


cost  = 


aMR 


+ 


(1  —  a)FAR 
Nt 


(10) 


where  Nt  and  Nf  are  the  number  of  true  and  impostor  trials  respectively.  An  example 
of  the  dual  DET  curve  versus  logistic  function  is  shown  in  Figure  5.  This  shows  that  the 
dual  DET  curve  is  a  reasonable  estimator  of  confidence  as  a  function  of  score,  in  that  it 
closely  approximates  the  logistic  curve. 

We  used  the  NIST  98  and  99  male  datasets  for  development  and  evaluation  data  respec¬ 
tively.  The  development  data  was  used  to  estimate  the  parameters  of  the  logistic  curve  and 
dual  DET  curve.  The  evaluation  data  was  used  to  evaluate  the  quality  of  the  confidence 
estimator. 

We  tested  two  systems,  which  are  referred  to  as  Systems  A  and  B.  System  A  was  also  tested 
in  a  more  realistic  scenario  where  20  speakers  were  tested  against  each  audio  file  and  the 
set  of  speakers  was  identical  for  each  audio  file  (this  differs  from  the  usual  basic  single 
speaker  recognition  task,  where  a  different  set  of  11  speakers  was  tested  for  each  audio 
and  the  correct  speaker  was  usually  among  the  11  speakers  tested).  A  consequence  of  the 
third  experiment  is  that  there  are  many  more  impostor  trials  than  the  former  because  the 
proper  NIST  specification  was  designed  so  that  most  cuts  had  the  correct  speaker  among 
the  eleven  claimed  speakers. 


7 


DSTO  -RR-0358 


We  tested  nine  conditions:  (1)  all  trials  (2)  only  those  with  long  utterance  length  (3)  only 
those  with  medium  utterance  length  (4)  only  those  with  short  utterance  length  (5)  all* 
trials4,  where  the  length  is  determined  and  the  logistic  or  dual  DET  curve  corresponding 
to  the  appropriate  level  (GOOD  OKAY  BAD)  is  chosen.  (6)-(9)  is  the  same  as  (2)-(5) 
but  using  SNR  instead  of  utterance  length.  For  each  condition,  the  NCE  is  recorded  for 
both  the  dual  DET  and  logistic  curve.  The  definition  of  good,  okay  and  bad  is  arbitrary. 
I  defined  them  by  sorting  the  data  according  to  utterance  length  and  dividing  them  into 
three  sets  of  equal  size.  The  NCE  results  of  System  A  (proper  NIST  eval),  Systems  A  (20 
speakers  only)  and  System  B  are  given  in  Tables  1 ,  2  and  3  respectively.  The  corresponding 
DET  curves  are  given  in  Figures  6,  7  and  8. 


Overall  the  dual  DET  curve  is  slightly  superior  to  the  logistic  curve.  One  possible  expla¬ 
nation  is  that  the  logistic  curve  has  more  degrees  of  freedom.  More  specifically,  (9)  implies 
the  dual  DET  curve  has  21  degrees  of  freedom  (for  every  value  of  q ,  the  parameter  6q 
represents  one  degree  of  freedom)  and  the  logistic  curve  has  only  two. 

Table  1:  Dual  DET  Curve  vs  Logistic  (System  A  proper  NIST  evaluation) 


CONDITION 

NCE(dual  DET  curve) 

NCE(logistic) 

ALL 

0.574 

0.539 

GOOD  length 

0.697 

0.705 

OK  length 

0.609 

0.596 

BAD  length 

0.413 

0.325 

ALL*  length 

0.575 

0.545 

GOOD  snr 

0.679 

0.691 

OK  snr 

0.609 

0.589 

BAD  snr 

0.435 

0.342 

ALL*  snr 

0.575 

0.542 

Table  2:  Dual  DET  Curve  vs  Logistic  (System  A,  20  speakers  only) 


CONDITION 

NCE(dual  DET  curve) 

NCE(logistic) 

ALL 

0.643 

0.624 

GOOD  length 

0.766 

0.765 

OK  length 

0.713 

0.681 

BAD  length 

0.428 

0.377 

ALL*  length 

0.659 

0.632 

GOOD  snr 

0.604 

0.559 

OK  snr 

0.653 

0.614 

BAD  snr 

0.634 

0.613 

ALL*  snr 

0.632 

0.593 

4The  asterisk  is  only  used  to  differentiate  this  from  case  (1). 

8 


DSTO-RR-0358 


Table  3:  Dual  DET  Curve  vs  Logistic  (System  B,  proper  NIST  evaluation) 


CONDITION 

NCE(dual  DET  curve) 

NCE(logistic) 

ALL 

0.635 

0.608 

GOOD  length 

0.671 

0.668 

OK  length 

0.647 

0.643 

BAD  length 

0.579 

0.516 

ALL*  length 

0.633 

0.609 

GOOD  snr 

0.726 

0.700 

OK  snr 

0.682 

0.675 

BAD  snr 

0.503 

0.471 

ALL*  snr 

0.637 

0.616 

7  Summary  and  Conclusion 


Our  objective  is  to  obtain  a  confidence  level  from  an  LLR  score,  since  the  latter  is  difficult 
to  interpret.  I  interpret  confidence  as  the  probability  of  the  speaker  hypothesis  being  true, 
given  the  evidence.  There  are  many  ways  to  derive  a  confidence  level  given  some  evidence 
(LLR  score,  channel  condition  etc).  One  of  the  simplest  metrics  assumes  that  the  only 
evidence  taken  into  account  is  the  score.  The  true  and  impostor  score  distributions  are 
assumed  Gaussian,  and  the  confidence  is  a  logistic  function  of  score  (this  is  shown  using 
Bayes  Law).  This  metric  is  used  as  a  baseline. 

We  proposed  a  new  confidence  estimator.  It  is  similar  to  the  logistic  curve  in  that  the 
only  evidence  taken  into  account  is  the  score.  The  main  advantage  is  that  it  avoids  the 
false  assumption  that  true  and  impostor  scores  are  distributed  according  to  a  Gaussian 
distribution.  The  proposed  confidence  estimator  has  an  interesting  property:  it  represents 
the  same  information  as  a  standard  DET  curve.  Thus  confidence  and  the  DET  curve  are 
inherently  related.  For  this  reason  my  confidence  estimator  is  called  the  dual  DET  curve. 

The  dual  DET  curve  is  approximated  by  a  piecewise  linear  function,  with  nodes  corre¬ 
sponding  to  fixed  confidence  levels.  However,  the  “granularity”  (number  of  confidence 
levels)  can  be  varied  depending  on  the  application.  The  dual  DET  curve  performs  slightly 
better  than  the  logistic  curve.  This  can  be  attributed  to  the  fact  the  dual  DET  curve 
has  more  degrees  of  freedom.  On  the  other  hand,  more  data  is  required  to  estimate  the 
parameters  of  the  DET  curve. 

The  confidence  is  a  monotonically  increasing  function  of  score.  This  means  that  thresh¬ 
olding  at  a  certain  confidence  level  (e.g.  accepting  all  trials  whose  confidence  is  at  least 
90%)  is  equivalent  to  thresholding  at  a  certain  level  in  the  score  domain.  The  difference 
is  that  the  former  is  more  meaningful,  since  90%  is  interpretable  as  a  probability  of  a 
concrete  event  (namely  the  speaker  hypothesis  being  true,  given  the  evidence)  whereas  an 
LLR  admits  no  interpretation. 


9 


DSTO  -RR-0358 


An  obvious  direction  for  future  research  is  to  study  confidence  measures  that  depend 
on  some  evidence  other  than  score,  e.g.  SNR  or  channel  type.  This  would  raise  some 
interesting  issues.  For  example  the  monotonic  relationship  between  confidence  and  score 
would  no  longer  hold.  If  the  evidence  consists  of,  say,  LLR  score  and  SNR  then  one  cannot 
find  thresholds  in  the  LLR-SNR  domain  corresponding  to  90%  confidence. 


Acknowledgements 

I  wish  to  acknowledge  Jeremy  Waller  and  Darryn  Smart  for  helpful  discussion  with  this 
report. 


10 


DSTO-RR-0358 


References 

1.  R.  Auckenthaler,  M.  Carey,  H.  Lloyd-Thomas,  Score  normalization  for  text- independent 
speaker  verification  system,  Digital  Speech  Processing  10  (1). 

2.  R.  Vogt,  S.  Sridharan,  M.  Mason,  Making  confident  speaker  verification  decisions  with 
minimal  speech,  in:  Interspeech  2008,  Brisbane,  Australia,  2008,  pp.  1405-1408. 

3.  W.  Campbell,  D.  Reynolds,  J.  Campbell,  K.  Brady,  Estimating  and  evaluating  confi¬ 
dence  for  forensic  speaker  recognition,  in:  ICASSP  2005,  no.  I,  2005,  pp.  717-720. 

4.  J.  G.  M.  Huggins,  Confidence  metrics  for  speaker  identification,  in:  Proc.  ICSLP,  2002, 
pp.  1381-1384. 

5.  J.  Richiardi,  P.  Prodanov,  A.  Drygajlo,  Speaker  verification  with  confidence  and  relia¬ 
bility  measures,  Vol.  1,  2006,  pp.  641-644. 

6.  J.  Richiardi,  A.  Drygajlo,  P.  Prodanov,  Confidence  and  reliability  measures  in  speaker 
verification,  Journal  of  the  Franklin  Institute  343  (6)  (2006)  574-595. 

7.  H.  Nakasone,  S.  Beck,  Forensic  automatic  speaker  recognition,  in:  Proc.  ISCA  workshop 
on  speaker  recognition  -  2001:  a  speaker  odyssey,  2001. 

8.  S.  Bengio,  C.  Marcel,  S.  Marcel,  J.  Mariethoz,  Confidence  measures  for  multimodal 
identity  verification,  Information  Fusion  3  (4)  (2002)  267-276. 


11 


MR  MR 


DSTO-RR-0358 


DET  curve,  n  =  1.0 


Dual  DET  curve,  n  =  1.0 


0  0.1  0.2  0.3  0.4  0.5  0.6  0.7 

FAR 


0.8  0.9 


Figure  1:  DET  curve  and  dual  n  =  1.0 


DET  curve,  n  =  1.1  Dual  DET  curve,  n  =  1.1 


Figure  2:  DET  curve  and  dual,  n  =  1.1 


12 


MR  MR 


DSTO-RR-0358 


DET  curve,  n  =  6.0 


Dual  DET  curve,  n  =  6.0 


Figure  3:  DET  curve  and  dual ,  n  =  6.0 


DET  curve,  piecewise  linear 


Dual  DET  curve,  piecewise  linear 


0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9 

FAR 


Figure  4:  DET  curve  and  dual,  piecewise  linear 


13 


Miss  probability  (in  %)  thf6Sholcl 


DSTO-RR-0358 


1 

0.9 
0.8 
0.7  - 
0.6  - 
0.5  - 
0.4  - 
0.3  - 
0.2  - 
0.1 


O 
-0  5 


Dual  DET  curve  vs  Logistic  curve 


-0.4 


-0.3 


-0.2 


-0.1 

score 


0.1 


ddc 

answers 

logistic 


0.2 


0.3 


Figure  5:  Dual  DET  curve  versus  Logistic  curve 


40 


NIST  99  males  -  Loquendo 


20  - 

10 

5 

2 

1 


0.2 

0.1 


.. 


all 

good  length 
-ok  length 
bad  length 


l 

V 

^  1 
r 

J 

r 

r 

. 

l-t 

0.1  0.2  0.5  1  2  5  10  20  40 

False  Alarm  probability  (in  %) 


NIST  99  males  -  Loquendo 


Figure  6:  DET  curve,  System  A 


14 


Miss  probability  (in  %)  Miss  probability  (in  %) 


DSTO-RR-0358 


40 


20 

10 

5 

2 

1 

0.5 

0.2 

0.1 


NIST  99  males  ■  Loquendo(same  20  spkrs) 


. f . 1 . 1 . 1 . 

“L 

all 

. good  length 

- ok  length 

bad  length 

n 

'  i 

L_ 

T 

- 1 

. 1 

1 

L  _ 

- 

1 

L_  _ 

1 

1 

_ i _ i _ I  i _ I _ i _  1 1 _ 

0.1  0.2  0.5  1  2  5  10  20  40 


False  Alarm  probability  (in  %) 


NIST  99  males  -  Loquendo(same  20  spkrs) 


Figure  7:  DET  curve,  System  A  (20  speakers) 


NIST  99  males  -QUT 


>i 


■Q 

CU 

_Q 

O 

Q. 

Ul 

w 

2 


NIST  99  males -QUT 


40 


20 

10 

5 

2 

1 

0.5 

0.2 

0.1 


1 

N 

V 

all 

. good  length 

- ok  length 

"i 

bad  length 

V 

V 

■••A 

■ 

X. 

'l 

. .1, 

_ 

0.1  0.2  0.5  1  2  5  10  20  40 


False  Alarm  probability  (in  %) 


Figure  8:  DET  curve,  System  B 


15 


DSTO  -RR-0358 


Appendix  A  Derivation  of  the  Logistic  Curve 


Assume  that 


where 


p(Hi\s) 


1Tip(s\Hi) 

ir0p(s\H0)  +  7rip(s|-ffi) 


p(s\H0) 


1 


(x-  Pl)2\ 

2d2  )’ 

(x-fi0)2) 

2  a2  )' 


We  want  to  show  this  reduces  to  a  logistic  function 

P{s\Hi)  i  +  exp(-/30  -  (3lX) 


Equations  (A1)-(A3)  imply  that 


(Al) 


(A2) 

(A3) 


(A4) 


p(Hi\s) 


7rie-a;2/2cr2  g2/.tix/2cr2  e~n2/2a2 

7rie-x2/2cr2e2^ia;/2cr2e-Atf/2cr2  _|_  ^g-x2 /2<r2  e2fj,0x/2cr2  e~fJ,Q/2cr2 
7]-l  e2fnx/2a2  e-nl/2 a2 
g  /2cr2  _j_  ^^^2^lqx/2<t2  g 

CieDlx  + 

1 

1  + 

1 

1  +  exp(-/30  -  /3ix) 


(A5) 

(A6) 

(A7) 

(A8) 

(A9) 


where  C\  =  exp(— p2/2a2),  D\  =  2pi/2a2  and  similarly  for  Co  and  Do ,  /?o  =  —  log(Co/Ci) 
and  /5i  =  (Di  —  Do). 

Hence  p(s\H\)  reduces  to  a  logistic  function,  as  claimed. 


16 


DSTO-RR-0358 


Appendix  B  Derivation  of  NCE 


We  wish  to  derive  equations  (6)-(8). 


A  = 


E[tfi?-log 2q{E)  :  -  log(l  -  q(E))] 

=  f  p(E,u)[Hi?  -  log2  q(E)  :  -log(l  -  q(E))\d(E,u) 

J  E,lu 

=  5>(w)  [  p(£|a;)[ffi?  -log2  q(E)  :  -log(l  -  q(E))]dE 

UJ  J E 

=  -  f  n0p(E\H0)  log(l  -  q(E))  -  TTip(E\H1)logq(E)dE, 

Je 


(Bl) 

(B2) 

(B3) 

(B4) 


thus  establishing  equation  (6). 

To  establish  equation  (7),  one  can  use  elementary  calculus  to  show  that  the  function 

f(q)  =  —  Alogq  —  .Blog(l  —  q),  0  <  q  <  1,  A  >  0,  B  >  0  (B5) 


is  minimised  when  q  =  A/(A  +  B).  Note  that  the  convention  OlogO  =  0  is  necessary  when 
A  =  0  or  B  =  0.  Therefore  (B4)  is  minimised  when 


q(E) 


_ ttip(E\Hi) _ 

ir0p(E\H0)  +  7rip(£|tfi) 


p(Hi\E), 


(B6) 


using  Bayes  law,  and  hence  we  obtain  (7). 

For  a  given  data  set  we  approximate  (6)  by  replacing  the  integration  with  a  finite  sum¬ 
mation  over  a  discrete  set  of  values  of  E  where  there  is  at  least  one  occurence  of  E.  The 
quantity  p(E\Hi)  can  be  approximated  by  Nst/Nt  where  the  numerator  is  the  number  of 
true-speaker  trials  where  the  value  of  E  was  obtained  as  evidence,  and  the  denominator  is 
the  total  number  of  true-speaker  trials.  The  approximation  p(E\Ho)  ~  Ne//Nj  is  similar. 

Thus  (B4)  can  be  approximated  by 


A 


_7r°  5Z  wF log^  _  _  71-1 wr log  q 

E  Nf  E  Nt 

i  Nf  iNt 

_7r°NrHlog(1  _ q(E!))  -7ri-jvrS1°g9(£?)> 

lyf  i= 0  Iyt  i=  1 


(B7) 

(B8) 


and  we  obtain  equation  (8). 


17 


Page  classification:  UNCLASSIFIED 


DEFENCE  SCIENCE  AND  TECHNOLOGY  ORGANISATION  1.  caveat/privacy  marking 
DOCUMENT  CONTROL  DATA 

2.  TITLE  3.  SECURITY  CLASSIFICATION 

A  Confidence  Estimator  for  Speaker  Verification  Document  (U) 

Using  Dual  DET  Curves  Title  (U) 

Abstract  (U) 

4.  AUTHOR  5.  CORPORATE  AUTHOR 

T.  C.  Tao  Defence  Science  and  Technology  Organisation 

PO  Box  1500 

Edinburgh,  South  Australia  5111,  Australia 

6a.  DSTO  NUMBER  6b.  AR  NUMBER  6c.  TYPE  OF  REPORT  7.  DOCUMENT  DATE 

DSTO-RR-0358  AR-014-858  Research  Report  October,  2010 


8.  FILE  NUMBER 

9.  TASK  NUMBER 

10.  SPONSOR 

11.  No.  OF  PAGES 

12.  No.  OF  REFS 

2009/1137055/1 

DST  97/007 

CDS 

17 

8 

13.  URL  OF  ELECTRONIC  VERSION  14.  RELEASE  AUTHORITY 

http://www.dsto.defence.gov.au/corporate/  Chief,  Command,  Control,  Communications  and 

reports/DSTO-RR-0358.pdf  Intelligence  Division 


15.  SECONDARY  RELEASE  STATEMENT  OF  THIS  DOCUMENT 
Approved,  for  Public  Release 

OVERSEAS  ENQUIRIES  OUTSIDE  STATED  LIMITATIONS  SHOULD  BE  REFERRED  THROUGH  DOCUMENT  EXCHANGE,  PO  BOX  1500, 
EDINBURGH,  SOUTH  AUSTRALIA  5111 

16.  DELIBERATE  ANNOUNCEMENT 

No  Limitations 

17.  CITATION  IN  OTHER  DOCUMENTS 
No  Limitations 

18.  DSTO  RESEARCH  LIBRARY  THESAURUS 

Speech  processing 
Voice  recognition 

19.  ABSTRACT 

In  speaker  verification,  the  result  of  a  trial  is  traditionally  summarised  as  an  arbitrary  score,  where  a 
higher  score  indicates  stronger  evidence  in  favour  of  the  speaker  hypothesis.  However  this  is  difficult  to 
interpret.  It  is  useful  to  convert  this  score  into  a  “confidence  level”,  i.e.  the  posterior  probability  that 
the  speaker  hypothesis  is  correct,  given  the  score.  One  of  the  simplest  formulae  to  obtain  a  confidence 
level  is  using  a  logistic  curve,  but  this  requires  the  assumption  that  the  true  and  impostor  speaker 
scores  are  distributed  according  to  a  Normal  distribution.  In  this  report  I  propose  a  new  formula, 
called  the  dual  Detection  Error  Trade-Off  (DET)  curve,  since  it  represents  the  same  information  as  a 
DET  curve.  This  formula  avoids  the  assumption  of  normally  distributed  target  and  impostor  scores. 
Experiments  on  the  NIST  99  data  prove  the  dual  DET  curve  performs  slightly  better  than  the  logistic 
curve. 

Page  classification:  UNCLASSIFIED 


